Transformer Architecture - Self-Attention
This note explores one of the most crucial concepts in the Transformer architecture: self-attention. Self-attention is a mechanism that determines how every token in a given sequence relates to every other token. The concept of self-attention is especially interesting to me because of its connection to computational neuroscience, seen in modern Hopfield networks[1] or Transformer implementations modeled with neurons and astrocytes[2]. My exploration here is primarily based on the famous paper 'Attention Is All You Need[3] and the book 'Deep Learning: Foundations and Concepts[4].
1. Modeling Token Interactions
To understand the self-attention process, we first need to understand how a transformer processes data. Mathematically, the input data to a transformer is a set of vectors \( \{ \mathbf{x}_n \} \), where \( n = 1, \dots, N \) and \( \mathbf{x}_n \in \mathbb{R}^{D} \). Each data vector represents a token.
The input to the transformer will be a matrix \( \mathbf{X} \) of dimensions \( N \times D \), where the \( n \)-th row is the transposed token vector \( \mathbf{x}_n^T \).
To capture the relationships between all tokens in the set, we need a mechanism that updates each token's representation based on the others. At this stage, before diving into the specifics of attention, we can simply view this as a function \(F \) that transforms an input matrix into an output matrix of the same size:
In the following sections, we will unpack this function \(F \) to see exactly how the self-attention mechanism computes this output.
2. Mathematical Assumptions of Attention
Given the mapping
we want the output matrix \( \mathbf{X}' \) to capture richer contextual information about the relationships between tokens.
Let's analyze how a single output token \( \mathbf{x}_n' \) can be constructed based on the input data \( \mathbf{X} \). The first thing we notice is that to fully understand its context, it should draw information from all input tokens, not just a few of them. To achieve this mathematically, we can express the new token as a linear combination of all input tokens:
where \( \alpha_{nm} \) is called the attention weight.
To give meaning to the linear combination, we can analyze it with a probabilistic approach.
The weights represent a discrete probability distribution over the input sequence. More precisely, \( \alpha_{nm} \) represents the probability that the updated representation of the \(n\)-th token integrates information from the \(m\)-th token.
With this approach, the new token representation \(\mathbf{x}_n'\) is defined as the expected value of the input tokens under this probability distribution. To form a valid probability distribution, the weights must satisfy two fundamental axioms:
- Non-negativity: \( \alpha_{nm} \ge 0 \).
- Summation to one: \( \sum_{m=1}^N \alpha_{nm} = 1 \).
Due to these requirements, it is highly probable that our derivation will employ the softmax function in the subsequent steps.
3. Defining Attention
Having established the fundamental properties of the attention weights, the next step is to define how to calculate them from the input data. To determine how closely token \( \mathbf{x}_n \) relates to token \( \mathbf{x}_m \), we must measure the mathematical similarity between their corresponding vectors.
There are multiple methods to measure vector similarity, including Euclidean distance and kernel functions. However, we will use the dot product, defined algebraically as the matrix multiplication \( \mathbf{x}_n^T \mathbf{x}_m \), following the architecture proposed in the paper Attention Is All You Need[3].
This method is advantageous for several reasons. Primarily, it requires fewer mathematical operations than computing the \( L_2 \) norm. Furthermore, as we will demonstrate later, we can mitigate the curse of dimensionality associated with the dot product by applying a scaling operation.
We define the raw similarity score between the \( n \)-th token and the \( m \)-th token as their dot product: \( \mathbf{x}_n^T \mathbf{x}_m \).
However, raw dot products can yield negative values, and their sum is not guaranteed to equal \( 1 \). To enforce the previously established probabilistic axioms, we apply the softmax function over all \( N \) tokens in the sequence. This yields the final attention weights:
Expressing this operation in matrix form yields:
where the term \(\text{Softmax}\big(\mathbf{X}\mathbf{X}^T \big)\) represents a "naive" version of self-attention. Because this formulation currently lacks learnable parameters, the necessary next step is to introduce weight matrices into the model to enable training.
4. Model Parametrization
The current formulation of the self-attention model has a fundamental limitation: it lacks learnable parameters. This absence of trainable weights leads to two specific mathematical issues:
- Symmetry of weights: The basic dot product operation \( \mathbf{x}_n^T \mathbf{x}_m \) yields symmetric attention scores. Consequently, the attention token \( n \) pays to token \( m \) is forced to be identical to the attention token \( m \) pays to token \( n \). This contradicts the directed nature of relationships in sequence processing.
- Lack of feature extraction: The model cannot isolate specific features of a token that might be relevant for a given context. It is restricted to using the exact same raw input representations both for computing similarity and for constructing the final output.
To address the limitation of parameterization, we can apply a linear transformation to the input data \( \mathbf{X} \) using a learnable weight matrix \( \mathbf{U} \):
Consequently, the self-attention equation takes the following form:
Although this modification resolves the feature extraction problem, a critical flaw remains: the attention weights are still symmetric. The term \( \mathbf{U}\mathbf{U}^T \) guarantees a symmetric output, meaning the attention score between token \( n \) and token \( m \) is identical to the score between token \( m \) and token \( n \).
In natural language processing, dependencies between tokens are strictly asymmetric. For instance, consider the semantic relationship between the words square and rectangle. This association is fundamentally directional: every square is a rectangle, but not every rectangle is a square. A symmetric attention mechanism fails to capture this directional dependency.
To break the structural symmetry caused by the term \( \mathbf{U}\mathbf{U}^T \), we must project the input sequence matrix \( \mathbf{X} \in \mathbb{R}^{N \times D} \) into distinct feature spaces. We achieve this by introducing three independent, learnable weight matrices: \( \mathbf{W}^Q \in \mathbb{R}^{D \times D_k} \), \( \mathbf{W}^K \in \mathbb{R}^{D \times D_k} \), and \( \mathbf{W}^V \in \mathbb{R}^{D \times D_v} \).
The naming convention for these matrices is derived directly from the Information Retrieval paradigm in computer science:
- Query (\( \mathbf{Q} = \mathbf{X}\mathbf{W}^Q \in \mathbb{R}^{N \times D_k} \)): This matrix represents the tokens when they are seeking context. A query vector encodes the specific information a token requires from the rest of the sequence.
- Key (\( \mathbf{K} = \mathbf{X}\mathbf{W}^K \in \mathbb{R}^{N \times D_k} \)): This matrix represents the tokens when they are being evaluated by others. A key vector encodes the specific information a token contains.
- Value (\( \mathbf{V} = \mathbf{X}\mathbf{W}^V \in \mathbb{R}^{N \times D_v} \)): This matrix represents the actual content. A value vector contains the extracted features that will be incorporated into the final output.
By separating the similarity computation into Queries and Keys, the interaction becomes strictly asymmetric. The dot product \( \mathbf{Q}\mathbf{K}^T \) results in an \( N \times N \) matrix of alignment scores, ensuring that the score between token \( n \) and token \( m \) is entirely independent of the score between token \( m \) and token \( n \).
Substituting these distinct projections into our equation yields the standard, unscaled self-attention formula:
5. Scaled Dot-Product Attention
As previously mentioned, the dot product operation suffers from the curse of dimensionality. To understand this mathematically, we must analyze the variance of the dot product between a query vector \( \mathbf{q} \) and a key vector \( \mathbf{k} \).
Let us assume that the components of \( \mathbf{q} \) and \( \mathbf{k} \) are independent random variables with a mean of \(0\) and a variance of \(1\). This assumption is practically justified in the Transformer architecture by standard weight initialization schemes and the continuous application of Layer Normalization, which forces the activations to maintain these exact statistical properties.
Under this assumption, we denote the components as \( q_d \) and \( k_d \), where \( \mathbb{E}[q_d] = \mathbb{E}[k_d] = 0 \) and \( \mathbb{E}[q_d^2] = \mathbb{E}[k_d^2] = 1 \). We can now derive the variance of their dot product over \( D_k \) dimensions:
Utilizing the linearity of expectation and the assumption of independence among the variables, we separate the expectations:
Given that the mean \( \mathbb{E}[q_d] = \mathbb{E}[k_d] = 0 \) and the variance \( \mathbb{E}[q_d^2] = \mathbb{E}[k_d^2] = 1 \), the equation simplifies to:
This mathematical proof demonstrates that the variance of the dot product grows linearly with the dimensionality \( D_k \). Consequently, for large values of \( D_k \), the dot product yields extremely large magnitudes. These large values push the softmax function into regions where the gradients approach zero, impeding the training process.
To counteract this growth and maintain a variance of 1, we must divide the dot product by its standard deviation, which is \( \sqrt{D_k} \). Incorporating this scaling factor yields the final, complete formula for the attention mechanism as defined in the standard Transformer architecture:
6. Conclusion and Next Steps
In this text, we derived the Scaled Dot-Product Attention mechanism. Starting with the mathematical definition of tokens, we addressed the limitations of symmetric dot products by introducing the Query, Key, and Value matrices. Finally, we mathematically proved the necessity of the \( \sqrt{D_k} \) scaling factor to preserve gradients during training.
Our next step will be a brief exploration of Hopfield Networks to understand the mathematical relationship between self-attention and associative memory.
Afterward, we will build upon our current foundation by introducing Multi-Head Attention and detailing how these components integrate into the complete Transformer model.
Cited Sources
- H. Ramsauer et al., „Hopfield Networks is All You Need”, 2020, arXiv. doi: 10.48550/ARXIV.2008.02217
- L. Kozachkov, K. V. Kastanenka, i D. Krotov, „Building transformers from neurons and astrocytes”, Proc. Natl. Acad. Sci., t. 120, nr 34, s. e2219150120, sie. 2023, doi: 10.1073/pnas.2219150120.
- A. Vaswani et al., „Attention Is All You Need”, 2017, arXiv. doi: 10.48550/ARXIV.1706.03762.
- C. M. Bishop i H. Bishop, Deep Learning: Foundations and Concepts. Cham: Springer International Publishing, 2024. doi: 10.1007/978-3-031-45468-4.
Additional Resources
- PyTorch Documentation: v2.10.0